1 Load packages

library(MASS)       # ginv -- coefficient estimation
library(splines)    # ns, bs -- spline curves
library(multcomp)   # glht -- linear hypotheses
library(edgeR)      # cpm, etc -- RNA-Seq normalization
library(limma)      # lmFit, etc -- fitting many models
library(tidyverse)  # working with data frames, plotting

Much of what we will be using is built into R without loading any packages.

2 Vectors and matrices

2.1 Vector operations

a <- c(3,4)
b <- c(5,6)

length(a)
## [1] 2

R performs operations elementwise:

a + b
## [1]  8 10
a * b
## [1] 15 24

We will be using the dot product a lot. This is:

sum(a*b)
## [1] 39
t(a) %*% b
##      [,1]
## [1,]   39

The geometric length of a vector is (by Pythagorus):

sqrt(sum(a*a))
## [1] 5

2.2 Matrix operations

We can create a matrix with matrix, rbind (row bind), or cbind (column bind).

matrix(c(1,2,3,4), nrow=2, ncol=2)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
rbind(c(1,3), c(2,4))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
cbind(c(1,2), c(3,4))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
X <- rbind(
    c(1,0),
    c(1,0),
    c(1,1),
    c(1,1))
X
##      [,1] [,2]
## [1,]    1    0
## [2,]    1    0
## [3,]    1    1
## [4,]    1    1
class(X)
## [1] "matrix"

The matrix transpose is obtained with t.

t(X)
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    0    0    1    1

Matrix multiplication is performed with %*%. The dot product of each row of the left hand side matrix and each column of the right hand side matrix is calculated. %*% treats a vector as either a single column or single row matrix as will make sense for matrix multiplication. Actually all we need today is to multiply a matrix by a vector, in which case we get the dot product of each row of the matrix with the vector.

X %*% a
##      [,1]
## [1,]    3
## [2,]    3
## [3,]    7
## [4,]    7
as.vector(X %*% a)
## [1] 3 3 7 7

2.3 Challenge - use a dot product to calculate

The following dot product is an elaborate way to retrieve x[2]:

x <- c(10,20,30,40)
weights <- c(0,1,0,0)       # <-- modify this line
sum(weights*x)

Modify weights in the above to calculate different quantities:

A. x[3]-x[2]

B. The mean of all four values.

3 Single numerical predictor

The age (year) and height (cm) of 10 people has been measured. We want a model that can predict height based on age.

people <- read_csv(
    "age, height
      10,    131
      14,    147
      16,    161
       9,    136
      16,    170
      15,    160
      15,    153
      21,    187
       9,    145
      21,    195")

ggplot(people, aes(x=age, y=height)) + geom_point()

fit <- lm(height ~ age, data=people)

fit
## 
## Call:
## lm(formula = height ~ age, data = people)
## 
## Coefficients:
## (Intercept)          age  
##      92.612        4.513

Coefficients are extracted with coef:

coef(fit)
## (Intercept)         age 
##   92.611502    4.512911

The residual standard deviation is extracted with sigma:

sigma(fit)
## [1] 7.263536

Behind the scenes a matrix of predictors has been produced from the mysterious notation ~ age. We can examine it explicitly:

model.matrix(fit)

model.matrix can be used without first calling lm.

model.matrix(~ age, data=people)
##    (Intercept) age
## 1            1  10
## 2            1  14
## 3            1  16
## 4            1   9
## 5            1  16
## 6            1  15
## 7            1  15
## 8            1  21
## 9            1   9
## 10           1  21
## attr(,"assign")
## [1] 0 1

n=10 observations minus p=2 columns in the model matrix leaves 8 residual degrees of freedom:

df.residual(fit)
## [1] 8

3.1 Prediction

predict predicts. By default it produces predictions on the original dataset.

predict(fit)
##        1        2        3        4        5        6        7        8 
## 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052 187.3826 
##        9       10 
## 133.2277 187.3826
predict(fit, interval="confidence")
##         fit      lwr      upr
## 1  137.7406 129.8100 145.6712
## 2  155.7923 150.4399 161.1446
## 3  164.8181 159.2250 170.4111
## 4  133.2277 124.3009 142.1545
## 5  164.8181 159.2250 170.4111
## 6  160.3052 154.9836 165.6267
## 7  160.3052 154.9836 165.6267
## 8  187.3826 177.6105 197.1547
## 9  133.2277 124.3009 142.1545
## 10 187.3826 177.6105 197.1547

We can also calculate predictions manually.

# Prediction for a 15-year old
x <- c(1, 15)
beta <- coef(fit)
sum(x * beta)
## [1] 160.3052
# Prediction for all original data
X <- model.matrix(fit)
as.vector( X %*% beta )
##  [1] 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052
##  [8] 187.3826 133.2277 187.3826

predict can be used with new data.

new_people <- tibble(age=5:25)
predict(fit, new_people)
##        1        2        3        4        5        6        7        8 
## 115.1761 119.6890 124.2019 128.7148 133.2277 137.7406 142.2535 146.7664 
##        9       10       11       12       13       14       15       16 
## 151.2793 155.7923 160.3052 164.8181 169.3310 173.8439 178.3568 182.8697 
##       17       18       19       20       21 
## 187.3826 191.8955 196.4085 200.9214 205.4343
new_predictions <- cbind(
    new_people, 
    predict(fit, new_people, interval="confidence"))

ggplot() +
    geom_ribbon(aes(x=age, ymin=lwr, ymax=upr), data=new_predictions, fill="grey") +
    geom_line(aes(x=age, y=fit), data=new_predictions, color="blue") +
    geom_point(aes(x=age, y=height), data=people) +
    labs(y="height (cm)", x="age (year)", 
         subtitle="Ribbon shows 95% confidence interval of the model")

If you have ever used geom_smooth, it should now be a little less mysterious.

ggplot(people, aes(x=age, y=height)) + geom_smooth(method="lm") + geom_point()

3.2 Residuals

The residuals are the differences between predicted and actual values.

residuals(fit)
##          1          2          3          4          5          6 
## -6.7406103 -8.7922535 -3.8180751  2.7723005  5.1819249 -0.3051643 
##          7          8          9         10 
## -7.3051643 -0.3826291 11.7723005  7.6173709
plot(predict(fit), residuals(fit))

Residuals should be close to normally distributed.

qqnorm(residuals(fit))
qqline(residuals(fit))

Ideally the points would lie on the line. Are they far enough away to worry about? We can simulate some data to judge against. Try this several times:

sim <- rnorm(10, mean=0, sd=sigma(fit))
qqnorm(sim)
qqline(sim)

plot(fit) produces a series of more sophisticated diagnostic plots.

plot(fit)

4 Single factor predictor, two levels

Consider a simple experiment where some outcome is measured for an untreated and a treated group. This can be viewed as a one-way ANalysis Of VAriance (ANOVA) experiment. (This is one of two senses in which the term ANOVA will be used today.)

outcomes <- read_csv(
       "group, outcome
    untreated,  4.98
    untreated,  5.17
    untreated,  5.66
    untreated,  4.87
      treated,  8.07
      treated, 11.02
      treated,  9.91")

outcomes$group <- factor(outcomes$group, c("untreated", "treated"))
outfit <- lm(outcome ~ group, data=outcomes)
outfit
## 
## Call:
## lm(formula = outcome ~ group, data = outcomes)
## 
## Coefficients:
##  (Intercept)  grouptreated  
##        5.170         4.497
df.residual(outfit)
## [1] 5
sigma(outfit)
## [1] 0.9804353
model.matrix(outfit)
##   (Intercept) grouptreated
## 1           1            0
## 2           1            0
## 3           1            0
## 4           1            0
## 5           1            1
## 6           1            1
## 7           1            1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

4.1 How coefficients are estimated

Coefficients are estimated from responses by multiplying by the “Moore-Penrose generalized inverse” of X. It can be useful to examine this to work out exactly what a fit is doing. Each row shows how the corresponding coefficient is estimated.

X <- model.matrix(outfit)
y <- outcomes$outcome
round(ginv(X), 3)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
## [1,]  0.25  0.25  0.25  0.25 0.000 0.000 0.000
## [2,] -0.25 -0.25 -0.25 -0.25 0.333 0.333 0.333
ginv(X) %*% y
##          [,1]
## [1,] 5.170000
## [2,] 4.496667

Here we can see the first coefficient is the average of the “untreated” samples, and the second is the average of the “treated” samples minus that average of the “untreated” samples.

( y contains noise, assumed to be identically normally distributed for each observation. Transformation of this noise by ginv(X) tells us the distribution of errors in the coefficients (see vcov()). This can be further propagated to give the distribution of errors in predictions, and in other linear combinations of coefficients. )

4.2 Challenge - the meanings of coefficients

We now consider the formula outcome ~ 0 + group.

Examine the model matrix that will be used:

model.matrix(outcome ~ 0 + group, data=outcomes)
  1. What column has been removed because 0 + was used?

  2. R has responded to this by being a bit clever when representing the factor. What column has been added?

  3. The mean of the untreated group is 5.2, the mean of the treated group is 9.7, and the difference between them is 4.5. Without using lm, what values should the coefficients have to best fit the data?

Now perform the actual linear model fit:

outfit2 <- lm(outcome ~ 0 + group, data=outcomes)
  1. Using sigma, does the new model fit the data better or worse than the original?

4.3 Testing a hypothesis

Besides data with categorical predictors, the term ANOVA is used to refer to the use of the F test. Significance is judged based on comparing the Residual Sums of Squares of two models. We fit a model representing a null hypothesis. This model formula must nest within our original model formula: any prediction it can make must also be possible to be made by the original model formula. We compare the models using the anova function.

outfit0 <- lm(outcome ~ 1, data=outcomes)

anova(outfit0, outfit)
## Analysis of Variance Table
## 
## Model 1: outcome ~ 1
## Model 2: outcome ~ group
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1      6 39.469                               
## 2      5  4.806  1    34.663 36.06 0.001839 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Warning: This is not the only way to use the anova( ) function, but I think it is the safest way. Once we start using multiple predictors, the meaning of the output from anova with a single model is likely to be not quite what you want, read the documentation carefully. The aov( ) function also has traps for the unwary. Use lm( ), and anova( ) with two nested models as in this document and the meaning should be as you expect.

summary( ) also outputs p-values. Too many p-values, summary( ) doesn’t respect the hierarchy of terms in the model. The p-value for dropping the intercept is nonsense. The p-values are based on a t statistic, where F=t^2.

summary(outfit)
## 
## Call:
## lm(formula = outcome ~ group, data = outcomes)
## 
## Residuals:
##          1          2          3          4          5          6 
## -1.900e-01  1.096e-15  4.900e-01 -3.000e-01 -1.597e+00  1.353e+00 
##          7 
##  2.433e-01 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.1700     0.4902  10.546 0.000132 ***
## grouptreated   4.4967     0.7488   6.005 0.001839 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9804 on 5 degrees of freedom
## Multiple R-squared:  0.8782, Adjusted R-squared:  0.8539 
## F-statistic: 36.06 on 1 and 5 DF,  p-value: 0.001839

confint( ) tells us not only that the difference between groups is non-zero but places a confidence interval on the difference. If the p-value were 0.05, the confidence interval would just touch zero. Whenever we reject a hypothesis that a single coefficient is zero, we may also conclude that we know its sign.

confint(outfit)
##                 2.5 %   97.5 %
## (Intercept)  3.909855 6.430145
## grouptreated 2.571764 6.421569

These results exactly match those of a t-test.

t.test(outcomes$outcome[5:7], outcomes$outcome[1:4], var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  outcomes$outcome[5:7] and outcomes$outcome[1:4]
## t = 6.005, df = 5, p-value = 0.001839
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.571764 6.421569
## sample estimates:
## mean of x mean of y 
##  9.666667  5.170000

4.4 Challenge - does height change with age?

Return to the people dataset.

  1. Can we reject the hypothesis that height is unrelated to age?

  2. Compare the result to the outcome of a correlation test using cor.test( ).

  3. What is the 95% confidence interval on the slope, in cm per year?

5 Multiple factors, many levels

Particle sizes of PVC plastic produced by a machine are measured. The machine is operated by three different people, and eight different batches of resin are used. Two measurements are made for each combination of these two experimental factors.

(This example is adapted from a data-set in the faraway package.)

pvc <- read_csv("r-linear-files/pvc.csv")
## Parsed with column specification:
## cols(
##   psize = col_double(),
##   operator = col_character(),
##   resin = col_character()
## )
pvc$operator <- factor(pvc$operator)
pvc$resin <- factor(pvc$resin)

ggplot(pvc, aes(x=resin, y=psize)) + geom_point() + facet_grid(~operator)

5.1 Main effects

pvcfit1 <- lm(psize ~ operator + resin, data=pvc)
summary(pvcfit1)
## 
## Call:
## lm(formula = psize ~ operator + resin, data = pvc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9500 -0.6125 -0.0167  0.4063  3.6833 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.2396     0.5226  69.345  < 2e-16 ***
## operatorBob   -0.2625     0.4048  -0.648  0.52059    
## operatorCarl  -1.5063     0.4048  -3.721  0.00064 ***
## resinR2       -1.0333     0.6610  -1.563  0.12630    
## resinR3       -5.8000     0.6610  -8.774 1.13e-10 ***
## resinR4       -6.1833     0.6610  -9.354 2.11e-11 ***
## resinR5       -4.8000     0.6610  -7.261 1.09e-08 ***
## resinR6       -5.4500     0.6610  -8.245 5.46e-10 ***
## resinR7       -2.9167     0.6610  -4.412 8.16e-05 ***
## resinR8       -0.1833     0.6610  -0.277  0.78302    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.145 on 38 degrees of freedom
## Multiple R-squared:  0.8595, Adjusted R-squared:  0.8262 
## F-statistic: 25.82 on 9 and 38 DF,  p-value: 1.474e-13
confint(pvcfit1)
##                  2.5 %     97.5 %
## (Intercept)  35.181635 37.2975319
## operatorBob  -1.081983  0.5569834
## operatorCarl -2.325733 -0.6867666
## resinR2      -2.371544  0.3048775
## resinR3      -7.138211 -4.4617892
## resinR4      -7.521544 -4.8451225
## resinR5      -6.138211 -3.4617892
## resinR6      -6.788211 -4.1117892
## resinR7      -4.254877 -1.5784559
## resinR8      -1.521544  1.1548775

This model assumes the influence of the two factors is additive, the model only contains the “main effects”. The meanings of the coefficients are:

  • “(Intercept)” is particle size for Alice and R1
  • “operatorBob” is particle size for Bob relative to Alice
  • “operatorCarl” is particle size for Carl relative to Alice
  • “resinR2” is particle size for R2 relative to R1
  • “resinR3” is particle size for R3 relative to R1
  • (etc)

We can use anova( ) to test if there is evidence either of these main effects is important. For example, to test if there is evidence that the operator is important to the outcome we can test pvcfit1 against a model in which operator is dropped:

pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
## 
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
## 1     40 70.533                              
## 2     38 49.815  2    20.718 7.902 0.00135 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.2 Interactions

We can ask if there is any interaction between the two factors. For example Carl might produce particularly small particles with R5. An additive model doesn’t allow for this.

pvcfit2 <- lm(psize ~ operator + resin + operator:resin, data=pvc)
# or
pvcfit2 <- lm(psize ~ operator*resin, data=pvc)

pvcfit2
## 
## Call:
## lm(formula = psize ~ operator * resin, data = pvc)
## 
## Coefficients:
##          (Intercept)           operatorBob          operatorCarl  
##                36.25                 -0.85                 -0.95  
##              resinR2               resinR3               resinR4  
##                -1.10                 -5.55                 -6.55  
##              resinR5               resinR6               resinR7  
##                -4.40                 -6.05                 -3.35  
##              resinR8   operatorBob:resinR2  operatorCarl:resinR2  
##                 0.55                  1.05                 -0.85  
##  operatorBob:resinR3  operatorCarl:resinR3   operatorBob:resinR4  
##                -0.20                 -0.55                  1.20  
## operatorCarl:resinR4   operatorBob:resinR5  operatorCarl:resinR5  
##                -0.10                  0.40                 -1.60  
##  operatorBob:resinR6  operatorCarl:resinR6   operatorBob:resinR7  
##                 1.30                  0.50                  0.45  
## operatorCarl:resinR7   operatorBob:resinR8  operatorCarl:resinR8  
##                 0.85                  0.50                 -2.70

This model allows for interactions between the two factors. There are enough predictors in the model matrix that each combination of levels can take on a distinct value. So we now have

  • “(Intercept)” is particle size for Alice and R1
  • “operatorBob” is particle size for Bob relative to Alice, for R1
  • “operatorCarl” is particle size for Carl relative to Alice, for R1
  • “resinR2” is particle size for R2 relative to R1, for Alice
  • (etc)
  • “operatorBob:resinR2” is particle size for Bob and R2, relative to (Intercept)+operatorBob+resinR2.
  • (etc)
anova(pvcfit1, pvcfit2)
## Analysis of Variance Table
## 
## Model 1: psize ~ operator + resin
## Model 2: psize ~ operator * resin
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     38 49.815                           
## 2     24 35.480 14    14.335 0.6926 0.7599

5.3 Contrasts and confidence intervals

anova( ) lets us test if a particular factor or interaction is needed at all, and summary( ) allows us to see if any levels of a factor differ from the first level. However we may wish to perform different comparisons of the levels of a factor – this is called a “contrast”. We might also want to test some more complicated combination of coefficients such as a difference between two hypothetical individuals. In general this is called a “linear hypotheses” or a “general linear hypothesis”.

Say we want to compare Bob and Carl’s particle sizes. We will use the pvcfit1 model.

coef(pvcfit1)
##  (Intercept)  operatorBob operatorCarl      resinR2      resinR3 
##   36.2395833   -0.2625000   -1.5062500   -1.0333333   -5.8000000 
##      resinR4      resinR5      resinR6      resinR7      resinR8 
##   -6.1833333   -4.8000000   -5.4500000   -2.9166667   -0.1833333
K <- rbind(Carl_vs_Bob = c(0, -1,1, 0,0,0,0,0,0,0))

K %*% coef(pvcfit1)
##                 [,1]
## Carl_vs_Bob -1.24375

This is the estimated difference in particle size between Carl and Bob, but can we trust it? The glht function from multcomp can tell us. GLHT stands for General Linear Hypothesis Test.

library(multcomp)

result <- glht(pvcfit1, K)
result
## 
##   General Linear Hypotheses
## 
## Linear Hypotheses:
##                  Estimate
## Carl_vs_Bob == 0   -1.244
summary(result)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)   
## Carl_vs_Bob == 0  -1.2437     0.4048  -3.072  0.00391 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Quantile = 2.0244
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                  Estimate lwr     upr    
## Carl_vs_Bob == 0 -1.2437  -2.0632 -0.4243

glht can test multiple hypotheses at once. By default it applies a multiple testing correction when doing so. This is a generalization of Tukey’s Honestly Significant Differences.

K <- rbind(
    Bob_vs_Alice  = c(0,  1,0, 0,0,0,0,0,0,0),
    Carl_vs_Alice = c(0,  0,1, 0,0,0,0,0,0,0),
    Carl_vs_Bob   = c(0, -1,1, 0,0,0,0,0,0,0))
result <- glht(pvcfit1, K)
summary(result)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>|t|)   
## Bob_vs_Alice == 0   -0.2625     0.4048  -0.648   0.7943   
## Carl_vs_Alice == 0  -1.5063     0.4048  -3.721   0.0018 **
## Carl_vs_Bob == 0    -1.2437     0.4048  -3.072   0.0108 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Quantile = 2.4379
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                    Estimate lwr     upr    
## Bob_vs_Alice == 0  -0.2625  -1.2494  0.7244
## Carl_vs_Alice == 0 -1.5063  -2.4931 -0.5194
## Carl_vs_Bob == 0   -1.2437  -2.2306 -0.2569
plot(result)

We can also turn off the multiple testing correction.

summary(result, test=adjusted("none"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>|t|)    
## Bob_vs_Alice == 0   -0.2625     0.4048  -0.648  0.52059    
## Carl_vs_Alice == 0  -1.5063     0.4048  -3.721  0.00064 ***
## Carl_vs_Bob == 0    -1.2437     0.4048  -3.072  0.00391 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

A reasonable compromise between these extremes is Benjamini and Hochberg’s False Discovery Rate (FDR) correction.

summary(result, test=adjusted("fdr"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
## 
## Linear Hypotheses:
##                    Estimate Std. Error t value Pr(>|t|)   
## Bob_vs_Alice == 0   -0.2625     0.4048  -0.648  0.52059   
## Carl_vs_Alice == 0  -1.5063     0.4048  -3.721  0.00192 **
## Carl_vs_Bob == 0    -1.2437     0.4048  -3.072  0.00587 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- fdr method)

Finally, we can ask if any of the linear combinations is non-zero, i.e. whether the model with all three constraints applied can be rejected. This is equivalent to the anova( ) tests we have done earlier. (Note that while we have three constraints, the degrees of freedom reduction is 2, as any 2 of the constraints are sufficient. This makes me uneasy as it is reliant on numerical accuracy, better to just use any two of the constraints.)

summary(result, test=Ftest())
## 
##   General Linear Hypotheses
## 
## Linear Hypotheses:
##                    Estimate
## Bob_vs_Alice == 0   -0.2625
## Carl_vs_Alice == 0  -1.5063
## Carl_vs_Bob == 0    -1.2437
## 
## Global Test:
##       F DF1 DF2  Pr(>F)
## 1 7.902   2  38 0.00135
pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
## 
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
## 1     40 70.533                              
## 2     38 49.815  2    20.718 7.902 0.00135 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This demonstrates that the two methods of testing hypotheses–with the ANOVA test and with linear hypotheses–are equivalent.

5.4 Heteroscedasticity

ggplot(pvc, aes(x=resin, y=residuals(pvcfit1))) + 
    geom_point() + geom_hline(yintercept=0) + facet_grid(~operator)

Our assumption that the residual noise is uniformly normally distributed may not hold. Carl’s data seems to have greater standard deviation than Alice or Bob’s. When comparing Alice and Bob’s results, including Carl’s data in the model may alter the outcome.

5.5 Challenge - examine contrasts

Using the pvcfit1 model, construct linear hypotheses to see if the effect of:

  1. R8 is different to R4
  2. R2 is different to R1

6 Gene expression example

Tooth growth in mouse embryos is studied using RNA-Seq. The RNA expression levels of several genes are examined in the cells that form the upper and lower first molars, in eight individual mouse embryos that have been dissected after different times of embryo development. The measurements are in terms of “Reads Per Million”, essentially the fraction of RNA in each sample belonging to each gene, times 1 million.

(This data was extracted from ARCHS4 (https://amp.pharm.mssm.edu/archs4/). In the Gene Expression Omnibus it is entry GSE76316. The sample descriptions in GEO seem to be out of order, but reading the associated paper and the genes they talk about I think I have the correct order of samples!)

teeth <- read_csv("r-linear-files/teeth.csv")
## Parsed with column specification:
## cols(
##   sample = col_character(),
##   mouse = col_character(),
##   day = col_double(),
##   tooth = col_character(),
##   gene_ace = col_double(),
##   gene_smoc1 = col_double(),
##   gene_pou3f3 = col_double(),
##   gene_wnt2 = col_double()
## )
teeth$tooth <- factor(teeth$tooth, c("lower","upper"))
teeth$mouse <- factor(teeth$mouse)

It will be convenient to have a quick way to examine different genes and different models with this data.

# A convenience to examine different model fits
more_data <- expand.grid(
    day=seq(14.3,18.2,by=0.01),
    tooth=as_factor(c("lower","upper")))

look <- function(y, fit=NULL) {
    p <- ggplot(teeth,aes(x=day,group=tooth))
    if (!is.null(fit)) {
        more_ci <- cbind(
            more_data, 
            predict(fit, more_data, interval="confidence"))
        p <- p + 
            geom_ribbon(data=more_ci, aes(ymin=lwr,ymax=upr),alpha=0.1) + 
            geom_line(data=more_ci,aes(y=fit,color=tooth))
    }
    p + geom_point(aes(y=y,color=tooth)) +
        labs(y=deparse(substitute(y)))
}

# Try it out
look(teeth$gene_ace)

We could treat day as a categorical variable, as in the previous section. However let us treat it as numerical, and see where that leads.

6.1 Transformation

6.1.1 Ace gene

acefit <- lm(gene_ace ~ tooth + day, data=teeth)

look(teeth$gene_ace, acefit)

Two problems:

  1. The actual data appears to be curved, our straight lines are not a good fit.
  2. The predictions fall below zero, a physical impossibility.

In this case, log transformation of the data will solve both these problems.

log2_acefit <- lm( log2(gene_ace) ~ tooth + day, data=teeth)

look(log2(teeth$gene_ace), log2_acefit)

Various transformations of y are possible. Log transformation is commonly used in the context of gene expression. Square root transformation can also be appropriate with nicely behaved count data (technically, if the errors follow a Poisson distribution). This gene expression data is ultimately count based, but is overdispersed compared to the Poisson distribution so square root transformation isn’t appropriate in this case. The Box-Cox transformations provide a spectrum of further options.

6.1.2 Pou3f3 gene

In the case of the Pou3f3 gene, the log transformation is even more important. It looks like gene expression changes at different rates in the upper and lower molars, that is there is a significant interaction between tooth and day.

pou3f3fit0 <- lm(gene_pou3f3 ~ tooth + day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit0)

pou3f3fit1 <- lm(gene_pou3f3 ~ tooth * day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit1)

anova(pou3f3fit0, pou3f3fit1)
## Analysis of Variance Table
## 
## Model 1: gene_pou3f3 ~ tooth + day
## Model 2: gene_pou3f3 ~ tooth * day
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     13 4849.7                                  
## 2     12  415.2  1    4434.5 128.15 9.234e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(pou3f3fit1)["toothupper:day",]
##     2.5 %    97.5 % 
## -34.65685 -23.46937

The slopes of the lines confidently differ by at least 23.5 RPM per day.

Examining the residuals reveals a further problem: larger expression values are associated with larger residuals.

look(residuals(pou3f3fit1))

plot(predict(pou3f3fit1), residuals(pou3f3fit1))

qqnorm(residuals(pou3f3fit1))
qqline(residuals(pou3f3fit1))

Log transformation both removes the interaction and makes the residuals more uniform (except for one outlier).

log2_pou3f3fit0 <- lm(log2(gene_pou3f3) ~ tooth + day, data=teeth)
log2_pou3f3fit1 <- lm(log2(gene_pou3f3) ~ tooth * day, data=teeth)

anova(log2_pou3f3fit0, log2_pou3f3fit1)
## Analysis of Variance Table
## 
## Model 1: log2(gene_pou3f3) ~ tooth + day
## Model 2: log2(gene_pou3f3) ~ tooth * day
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     13 0.56714                           
## 2     12 0.56579  1 0.0013576 0.0288 0.8681
confint(log2_pou3f3fit1)["toothupper:day",]
##      2.5 %     97.5 % 
## -0.2225600  0.1903981

The ratio of fold-change-per-day between the upper and lower molars is confidently within 2^-0.22 to 2^0.19 (0.86 to 1.14).

look(log2(teeth$gene_pou3f3), log2_pou3f3fit0)

qqnorm(residuals(log2_pou3f3fit0))
qqline(residuals(log2_pou3f3fit0))

6.2 Curve fitting

6.2.1 Smoc1 gene

log2_smoc1fit <- lm(log2(gene_smoc1) ~ tooth + day, data=teeth)

look(log2(teeth$gene_smoc1), log2_smoc1fit)

In this case, log transformation does not remove the curve. If you think this is a problem for linear models, you are mistaken! With a little feature engineering we can fit a quadratic curve. Calculations can be included in the formula if wrapped in I( ):

curved_fit <- lm(log2(gene_smoc1) ~ tooth + day + I(day^2), data=teeth)
look(log2(teeth$gene_smoc1), curved_fit)

Another way to do this would be to add the column to the data frame:

teeth$day_squared <- teeth$day^2
curved_fit2 <- lm(log2(gene_smoc1) ~ tooth + day + day_squared, data=teeth)

Finally, the poly( ) function can be used in a formula to fit polynomials of arbitrary degree. poly will encode day slightly differently, but produces an equivalent fit.

curved_fit3 <- lm(log2(gene_smoc1) ~ tooth + poly(day,2), data=teeth)
sigma(curved_fit)
## [1] 0.1630425
sigma(curved_fit2)
## [1] 0.1630425
sigma(curved_fit3)
## [1] 0.1630425

poly( ) can also be used to fit higher order polynomials, but these tend to become wobbly and extrapolate poorly. A better option may be to use the ns( ) or bs( ) functions in the splines package, which can be used to fit piecewise “B-splines”. In particular ns( ) (natural spline) is appealing because it extrapolates beyond the ends only with straight lines. If the data is cyclic (for example cell cycle or circadian time series), sine and cosine terms can be used to fit some number of harmonics from a Fourier series.

library(splines)
spline_fit <- lm(log2(gene_smoc1) ~ tooth * ns(day,3), data=teeth)

look(log2(teeth$gene_smoc1), spline_fit)

6.3 Day is confounded with mouse

There may be individual differences between mice. We would like to take this into our account in a model. In general it is common to include batch effect terms in a model in order to correctly model the data (and increase the significance level of results), even if they are not directly of interest.

badfit <- lm(log2(gene_ace) ~ tooth + day + mouse, data=teeth)
summary(badfit)
## 
## Call:
## lm(formula = log2(gene_ace) ~ tooth + day + mouse, data = teeth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20562 -0.02335  0.00000  0.02335  0.20562 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.115920   0.658430 -19.920 2.01e-07 ***
## toothupper   -0.280467   0.070399  -3.984   0.0053 ** 
## day           0.992214   0.040228  24.665 4.59e-08 ***
## mouseind2     0.100073   0.131897   0.759   0.4728    
## mouseind3    -0.198260   0.125612  -1.578   0.1585    
## mouseind4    -0.122056   0.122349  -0.998   0.3517    
## mouseind5    -0.203226   0.122349  -1.661   0.1407    
## mouseind6    -0.009452   0.125612  -0.075   0.9421    
## mouseind7    -0.042441   0.131897  -0.322   0.7570    
## mouseind8           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1408 on 7 degrees of freedom
## Multiple R-squared:  0.9934, Adjusted R-squared:  0.9859 
## F-statistic: 131.9 on 8 and 7 DF,  p-value: 6.131e-07

In this case this is not possible, and R has arbirarily dropped a predictor from the model. As a different mouse produced data for each different day, mouse is confounded with day. day can be constructed as a linear combination of the intercept term and the mouse terms. The model suffers from multicollinearity.

Another example of confounding would be an experiment in which each treatment is done in a separate batch.

Even if predictors are not perfectly multicollinear, correlation between predictors can make their estimates inaccurate. One way to check for this is to attempt to predict each of the predictors with a linear model that uses the remaining predictors (see “Variance Inflation Factor”).

A possible solution to this problem would be to use a “mixed model”, but this is beyond the scope of today’s workshop.

6.4 Challenge - Wnt2 gene

Look at the expression of gene Wnt2 in column gene_wnt2.

  1. Try some different model formulas.

  2. Justify a particular model by rejecting simpler alternatives using anova( ).

7 Testing many genes with limma

In this section we look at fitting the same matrix of predictors X to many different sets of responses y. We will use the package limma from Bioconductor. This is a very brief demonstration, and there is much more to this package. See the excellent usersguide.pdf at https://bioconductor.org/packages/release/bioc/html/limma.html

7.1 Load, normalize, log transform

Actually in the teeth dataset, the expression level of all genes was measured!

counts_df <- read_csv("r-linear-files/teeth-read-counts.csv")
counts <- as.matrix( select(counts_df, -gene) )
rownames(counts) <- counts_df$gene

dim(counts)
## [1] 32544    16
counts[1:5,]
##               ind1lower ind1upper ind2lower ind2upper ind3lower ind3upper
## 0610007P14Rik      1721      1846      1708      1877      1443      1534
## 0610009B22Rik       512       466       554       504       501       498
## 0610009L18Rik        68        71       121       129        74       124
## 0610009O20Rik      2583      2797      2840      2975      2675      2690
## 0610010F05Rik      1079      1246      1262       992      1232       992
##               ind4lower ind4upper ind5lower ind5upper ind6lower ind6upper
## 0610007P14Rik      1389      1439      1699      1881      1697      1716
## 0610009B22Rik       478       427       569       614       577       568
## 0610009L18Rik        83        87        86       125        99       135
## 0610009O20Rik      2429      2313      2693      3093      2854      3066
## 0610010F05Rik      1011       977      1175      1384      1235      1221
##               ind7lower ind7upper ind8lower ind8upper
## 0610007P14Rik      1531      1625      1716      1373
## 0610009B22Rik       587       531       611       453
## 0610009L18Rik       130       132       114        92
## 0610009O20Rik      2487      2904      2713      2455
## 0610010F05Rik      1042       998      1079       882

The column names match our teeth data frame.

teeth$sample
##  [1] "ind1lower" "ind1upper" "ind2lower" "ind2upper" "ind3lower"
##  [6] "ind3upper" "ind4lower" "ind4upper" "ind5lower" "ind5upper"
## [11] "ind6lower" "ind6upper" "ind7lower" "ind7upper" "ind8lower"
## [16] "ind8upper"

A usual first step in RNA-Seq analysis is to convert read counts to Reads Per Million, and log2 transform the results. There are some subtleties here which we breeze over lightly: “TMM” normalization is used as a small adjustment to the total number of reads in each sample. A small constant “prior count” is added to the counts to avoid calculating log2(0). The edgeR and limma manuals describe these steps in more detail.

library(edgeR)
library(limma)

dgelist <- calcNormFactors(DGEList(counts))

dgelist$samples
##           group lib.size norm.factors
## ind1lower     1 37902941    0.9941578
## ind1upper     1 39786308    1.0028804
## ind2lower     1 43133654    0.9984159
## ind2upper     1 40974683    0.9950708
## ind3lower     1 41135198    1.0000633
## ind3upper     1 39615728    1.0011691
## ind4lower     1 37762217    0.9974437
## ind4upper     1 36832933    1.0048675
## ind5lower     1 41669201    1.0037162
## ind5upper     1 49717562    0.9967401
## ind6lower     1 44203546    1.0008790
## ind6upper     1 45924751    1.0041814
## ind7lower     1 39920915    1.0071850
## ind7upper     1 42882436    1.0131047
## ind8lower     1 44664204    0.9855638
## ind8upper     1 40780553    0.9948624
log2_cpms <- cpm(dgelist, log=TRUE, prior.count=1)

There is little chance of detecting differential expression in genes with very low read counts. Including these genes will require a larger False Discovery Rate correction, and also confuses limma’s Empirical Bayes parameter estimation. The typical library size in this data set is 40 million reads. Let’s only retain genes with an average of 1 read per sample or more. Remembering also the “prior count” of 1, this gives a cutoff of log2(2/40).

keep <- rowMeans(log2_cpms) >= log2(2/40)
log2_cpms_filtered <- log2_cpms[keep,]

nrow(log2_cpms)
## [1] 32544
nrow(log2_cpms_filtered)
## [1] 19277

7.2 Fitting a model to and testing each gene

We use limma to fit a linear model to each gene. The same model formula will be used in each case. limma doesn’t automatically convert a formula into a model matrix, so we have to do this step manually. Here I am using a model formula that treats the upper and lower teeth as following a different linear trend over time.

X <- model.matrix(~ tooth * day, data=teeth)
X
##    (Intercept) toothupper  day toothupper:day
## 1            1          0 14.5            0.0
## 2            1          1 14.5           14.5
## 3            1          0 15.0            0.0
## 4            1          1 15.0           15.0
## 5            1          0 15.5            0.0
## 6            1          1 15.5           15.5
## 7            1          0 16.0            0.0
## 8            1          1 16.0           16.0
## 9            1          0 16.5            0.0
## 10           1          1 16.5           16.5
## 11           1          0 17.0            0.0
## 12           1          1 17.0           17.0
## 13           1          0 17.5            0.0
## 14           1          1 17.5           17.5
## 15           1          0 18.0            0.0
## 16           1          1 18.0           18.0
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$tooth
## [1] "contr.treatment"
fit <- lmFit(log2_cpms_filtered, X)

class(fit)
## [1] "MArrayLM"
## attr(,"package")
## [1] "limma"
fit$coefficients[1:5,]
##               (Intercept) toothupper         day toothupper:day
## 0610007P14Rik   5.8170982  1.3595981 -0.03252644    -0.08305721
## 0610009B22Rik   3.1457085  0.6435675  0.03623474    -0.04848863
## 0610009L18Rik  -0.8927645  1.5324625  0.12970051    -0.08331755
## 0610009O20Rik   6.6205151  0.2793246 -0.03743937    -0.01505405
## 0610010F05Rik   5.8292931  0.2980921 -0.06411158    -0.02494750

Significance testing in limma is by the use of linear hypotheses (which limma refers to as “contrasts”). A difference between glht and limma’s contrasts.fit is that limma uses columns rather than rows.

We will first look for genes where the slope over time is not flat, averaging the lower and upper teeth.

# Lower slope: c(0,0,1,0)
# Upper slope: c(0,0,1,1)

K <- rbind(c(0,0,1,0.5))
cfit <- contrasts.fit(fit, t(K))       #linear hypotheses in columns!
efit <- eBayes(cfit, trend=TRUE)

The call to eBayes does Emprical Bayes squeezing of the residual variance for each gene (see appendix). This is a bit of magic that allows limma to work well with small numbers of samples.

topTable(efit)
##              logFC  AveExpr         t      P.Value    adj.P.Val        B
## Rnf144b  0.6685609 4.483495  33.45929 5.781246e-16 1.114451e-11 26.24848
## Six2    -0.5596098 6.978328 -29.58593 3.841220e-15 2.855688e-11 24.62967
## Tfap2b  -0.6435643 7.468392 -29.15102 4.822298e-15 2.855688e-11 24.43025
## Prom2    0.9146042 3.676520  28.76243 5.925586e-15 2.855688e-11 24.24876
## Ace      0.9838607 2.812857  27.97709 9.061123e-15 3.493426e-11 23.87210
## Stk32a   1.2239529 3.708474  27.51328 1.170771e-14 3.504942e-11 23.64323
## Vldlr    0.4127595 6.814182  27.36374 1.272739e-14 3.504942e-11 23.56839
## Bambi    0.5593308 6.104168  26.37914 2.230451e-14 5.374551e-11 23.06244
## Msx2     0.8107059 6.253309  25.30538 4.209704e-14 9.016717e-11 22.48322
## Igf2bp1 -0.5401323 5.915739 -25.07244 4.848104e-14 9.345691e-11 22.35359

The column adj.P.Val contains FDR adjusted p-values.

all_results <- topTable(efit, n=Inf)

significant <- all_results$adj.P.Val <= 0.05
table(significant)
## significant
## FALSE  TRUE 
## 12434  6843
ggplot(all_results, aes(x=AveExpr, y=logFC)) + 
    geom_point(size=0.1) +
    geom_point(data=all_results[significant,], size=0.1, color="red")

7.3 Relation to lm( ) and glht( )

Let’s look at a specific gene.

rnf144b <- log2_cpms["Rnf144b",]
rnf144b_fit <- lm(rnf144b ~ tooth * day, data=teeth)
look(rnf144b, rnf144b_fit)

We can use the same linear hypothesis with glht. The estimate is the same, but limma has gained some power by shrinking the variance toward the trend line, so limma’s p-value is smaller.

summary( glht(rnf144b_fit, K) )
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = rnf144b ~ tooth * day, data = teeth)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0   0.6686     0.0187   35.76 1.46e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

7.4 Confidence intervals

Confidence Intervals should also be of interest. However note that these are not adjusted for multiple testing (see appendix).

topTable(efit, confint=0.95)
##              logFC       CI.L       CI.R  AveExpr         t      P.Value
## Rnf144b  0.6685609  0.6261190  0.7110029 4.483495  33.45929 5.781246e-16
## Six2    -0.5596098 -0.5997862 -0.5194334 6.978328 -29.58593 3.841220e-15
## Tfap2b  -0.6435643 -0.6904574 -0.5966711 7.468392 -29.15102 4.822298e-15
## Prom2    0.9146042  0.8470615  0.9821470 3.676520  28.76243 5.925586e-15
## Ace      0.9838607  0.9091639  1.0585576 2.812857  27.97709 9.061123e-15
## Stk32a   1.2239529  1.1294612  1.3184445 3.708474  27.51328 1.170771e-14
## Vldlr    0.4127595  0.3807195  0.4447995 6.814182  27.36374 1.272739e-14
## Bambi    0.5593308  0.5142928  0.6043688 6.104168  26.37914 2.230451e-14
## Msx2     0.8107059  0.7426569  0.8787549 6.253309  25.30538 4.209704e-14
## Igf2bp1 -0.5401323 -0.5858911 -0.4943735 5.915739 -25.07244 4.848104e-14
##            adj.P.Val        B
## Rnf144b 1.114451e-11 26.24848
## Six2    2.855688e-11 24.62967
## Tfap2b  2.855688e-11 24.43025
## Prom2   2.855688e-11 24.24876
## Ace     3.493426e-11 23.87210
## Stk32a  3.504942e-11 23.64323
## Vldlr   3.504942e-11 23.56839
## Bambi   5.374551e-11 23.06244
## Msx2    9.016717e-11 22.48322
## Igf2bp1 9.345691e-11 22.35359

7.5 F test

limma can also test several simultaneous constraints on linear combinations of coefficients. Suppose we want to find any deviation from a constant expression level. We can check for this with:

K2 <- rbind(
    c(0,1,0,0),
    c(0,0,1,0),
    c(0,0,0,1))

cfit2 <- contrasts.fit(fit, t(K2))
efit2 <- eBayes(cfit2, trend=TRUE)
topTable(efit2)
##               Coef1      Coef2       Coef3   AveExpr        F      P.Value
## Pou3f3    4.0115573 -0.3374359 -0.01666389 5.1306000 510.7941 8.429160e-16
## Six2     -0.7932927 -0.6090109  0.09880229 6.9783275 411.1322 4.501191e-15
## Rnf144b  -1.4380281  0.6356437  0.06583438 4.4834946 395.6411 6.051473e-15
## Tfap2b   -4.3112993 -0.7917082  0.29628788 7.4683916 331.2794 2.371974e-14
## Prom2     0.6240851  0.9605544 -0.09190034 3.6765200 323.9055 2.819738e-14
## Nkx2-3  -12.4452233 -0.4514939  0.20547087 0.4715821 280.9029 8.408998e-14
## Vldlr     2.2392893  0.4744148 -0.12331065 6.8141822 270.6352 1.118379e-13
## Ace      -0.4796297  0.9777002  0.01232107 2.8128573 264.9243 1.316682e-13
## Stk32a    0.8910511  1.2630915 -0.07827735 3.7084739 257.2410 1.649128e-13
## Bambi     1.5851348  0.6164069 -0.11415222 6.1041676 244.6511 2.420043e-13
##            adj.P.Val
## Pou3f3  1.624889e-11
## Six2    3.888475e-11
## Rnf144b 3.888475e-11
## Tfap2b  1.087122e-10
## Prom2   1.087122e-10
## Nkx2-3  2.701671e-10
## Vldlr   3.079856e-10
## Ace     3.172710e-10
## Stk32a  3.532248e-10
## Bambi   4.665117e-10

A shortcut would be to use contrasts.fit(fit, coefficients=2:4) here instead, or to specify a set of coefficients directly to topTable( ).

7.6 Challenge - construct some linear hypotheses

Construct and use linear combinations to find genes that:

  1. Differ in slope between lower and upper molars.

  2. Differ in expression on day 16 between the lower and upper molars.

Hint: hypothesis 2 can be viewed as the difference in predictions between two individual samples.

  1. Construct a pair of linear combinations that when used together in an F test find genes with non-zero slope in either or both the lower or upper molars.

8 Appendix

8.1 Empirical Bayes variance squeezing

In limma, Empirical Bayes squeezing of the residual variance acts as though we have some number of extra “prior” observations of the variance. These are also counted as extra degrees of freedom in F tests. The “prior” observations act to squeeze the estimated residual variance toward a trend line that is a function of the average expression level.

efit <- eBayes(cfit, trend=TRUE)

efit$df.prior
## [1] 3.622052
efit$df.residual
##  [1] 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
##  [ reached getOption("max.print") -- omitted 19256 entries ]
efit$df.total
##  [1] 15.62205 15.62205 15.62205 15.62205 15.62205 15.62205 15.62205
##  [8] 15.62205 15.62205 15.62205 15.62205 15.62205 15.62205 15.62205
## [15] 15.62205 15.62205 15.62205 15.62205 15.62205 15.62205 15.62205
##  [ reached getOption("max.print") -- omitted 19256 entries ]
plotSA(efit)
points(efit$Amean, efit$s2.post^0.25, col="red", cex=0.2)

The total effective degrees of freedom is the “prior” degrees of freedom plus the normal residual degrees of freedom. As can be seen in the plot, compared to the residual variance (black dots), this produces a posterior residual variance (efit$s2.post, red dots) that is squeezed toward the trend line.

It’s worthwhile checking df.prior when using limma, as a low value may indicate a problem with a data-set.

8.2 False Coverage Rate corrected CIs

We noted the CIs produced by limma were not adjusted for multiple testing. A False Coverage Rate (FCR) corrected CI can be constructed corresponding to a set of genes judged significant. The smaller the selection of genes as a proportion of the whole, the greater the correction required. To ensure a False Coverage Rate of q, we use the confidence interval (1-q*n_genes_selected/n_genes_total)*100%.

all_results <- topTable(efit, n=Inf)
significant <- all_results$adj.P.Val <= 0.05
prop_significant <- mean(significant)
fcr_confint <- 1 - 0.05*prop_significant

all_results <- topTable(efit, confint=fcr_confint, n=Inf)

ggplot(all_results, aes(x=AveExpr, y=logFC)) + 
    geom_point(size=0.1, color="grey") +
    geom_errorbar(data=all_results[significant,], aes(ymin=CI.L, ymax=CI.R), color="red") +
    geom_point(data=all_results[significant,], size=0.1)

The FCR corrected CIs used here have the same q, 0.05, as we used as the cutoff for adj.P.Val. This means they never pass through zero.

I have some further thoughts on this topic, see the Bioconductor package topconfects (https://bioconductor.org/packages/release/bioc/html/topconfects.html).


sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] splines   stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] forcats_0.4.0     stringr_1.4.0     dplyr_0.8.3      
##  [4] purrr_0.3.2       readr_1.3.1       tidyr_1.0.0      
##  [7] tibble_2.1.3      ggplot2_3.2.1     tidyverse_1.2.1  
## [10] edgeR_3.26.8      limma_3.40.6      multcomp_1.4-10  
## [13] TH.data_1.0-10    survival_2.44-1.1 mvtnorm_1.0-11   
## [16] MASS_7.3-51.4    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-38 
##  [5] zoo_1.8-6        assertthat_0.2.1 zeallot_0.1.0    digest_0.6.21   
##  [9] plyr_1.8.4       R6_2.4.0         cellranger_1.1.0 backports_1.1.5 
## [13] evaluate_0.14    httr_1.4.1       pillar_1.4.2     rlang_0.4.0     
## [17] lazyeval_0.2.2   readxl_1.3.1     rstudioapi_0.10  Matrix_1.2-17   
## [21] rmarkdown_1.16   labeling_0.3     munsell_0.5.0    broom_0.5.2     
## [25] compiler_3.6.0   modelr_0.1.5     xfun_0.10        pkgconfig_2.0.3 
## [29] htmltools_0.4.0  tidyselect_0.2.5 codetools_0.2-16 crayon_1.3.4    
## [33] withr_2.1.2      grid_3.6.0       nlme_3.1-141     jsonlite_1.6    
## [37] gtable_0.3.0     lifecycle_0.1.0  magrittr_1.5     scales_1.0.0    
## [41] cli_1.1.0        stringi_1.4.3    reshape2_1.4.3   xml2_1.2.2      
## [45] ellipsis_0.3.0   vctrs_0.2.0      generics_0.0.2   sandwich_2.5-1  
## [49] tools_3.6.0      glue_1.3.1       hms_0.5.1        yaml_2.2.0      
## [53] colorspace_1.4-1 rvest_0.3.4      knitr_1.25       haven_2.1.1

Home